In this project, I worked through the process of:
By: Victor Cornejo
To begin, let's load the tips dataset from the seaborn library. This dataset contains records of tips, total bill, and information about the person who paid the bill. As earlier, we'll be trying to predict tips from the other data.
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.feature_extraction import DictVectorizer
import matplotlib.pyplot as plt
np.random.seed(42)
plt.style.use('fivethirtyeight')
sns.set()
sns.set_context("talk")
%matplotlib inline
data = sns.load_dataset("tips")
print("Number of Records:", len(data))
data.head()
Number of Records: 244
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
Moving away fromt the constant model, it was time for a more complicated model that utilized other features in the dataset. I imagined that I might want to use the features with an equation that looks as shown below:
$$ \text{Tip} = \theta_1 \cdot \text{total}\_\text{bill} + \theta_2 \cdot \text{sex} + \theta_3 \cdot \text{smoker} + \theta_4 \cdot \text{day} + \theta_5 \cdot \text{time} + \theta_6 \cdot \text{size} $$Unfortunately, that's not possible because some of these features like "day" are not numbers, so it doesn't make sense to multiply by a numerical parameter.
Therefore, I started by converting some of these non-numerical values into numerical values. Before I did this, I separated out the tips and the features into two separate variables.
tips = data['tip']
X = data.drop(columns='tip')
First, I converted my features to numerical values. A straightforward approach was to map some of these non-numerical features into numerical ones.
For example, I could treat the day as a value from 1-7. However, one of the disadvantages in directly translating to a numeric value is that of unintentionally assigning certain features disproportionate weight. For example; assigning Sunday to the numeric value of 7, and Monday to the numeric value of 1. In the linear model, Sunday will have 7 times the influence of Monday, which can lower the accuracy of the model.
Instead, I used one-hot encoding to better represent these features.
In the tips dataset for example, I encoded Sunday as the vector [0 0 0 1] because the dataset only contains bills from Thursday through Sunday. This assigns a more even weight across each category in non-numeric features. The code below one-hot encodes our dataset. The returning dataframe holds the "featurized" data, which is also often denoted by $\phi$.
pd.get_dummies(X["day"]).head()
| Thur | Fri | Sat | Sun | |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 1 |
| 2 | 0 | 0 | 0 | 1 |
| 3 | 0 | 0 | 0 | 1 |
| 4 | 0 | 0 | 0 | 1 |
def one_hot_encode(data):
"""
Return the one-hot encoded dataframe of our input data.
Parameters
-----------
data: a dataframe that may include non-numerical features
Returns
-----------
A one-hot encoded dataframe that only contains numeric features
"""
data = pd.get_dummies(data)
return data
one_hot_X = one_hot_encode(X)
one_hot_X.head()
| total_bill | size | sex_Male | sex_Female | smoker_Yes | smoker_No | day_Thur | day_Fri | day_Sat | day_Sun | time_Lunch | time_Dinner | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| 1 | 10.34 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| 2 | 21.01 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| 3 | 23.68 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| 4 | 24.59 | 4 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
Once all of the data was numeric, I could begin to define the model function. Notice that after one-hot encoding the data, I now had 12 features instead of 6. Therefore, the linear model looked like:
$$ \text{Tip} = \theta_1 \cdot \text{size} + \theta_2 \cdot \text{total}\_\text{bill} + \theta_3 \cdot \text{day}\_\text{Thur} + \theta_4 \cdot \text{day}\_\text{Fri} + ... + \theta_{11} \cdot \text{time}\_\text{Lunch} + \theta_{12} \cdot \text{time}\_\text{Dinner} $$I could represent the linear combination above as a matrix-vector product. Implementing the linear_model function to evaluate this product.
Below, I created a MyLinearModel class with two methods, predict and fit. When fitted, this model fails to do anything useful, setting all of its 12 parameters to zero.
class MyLinearModel():
def predict(self, X):
return X @ self._thetas
def fit(self, X, y):
number_of_features = X.shape[1]
self._thetas = np.zeros(shape = (number_of_features, 1))
model = MyLinearModel()
model.fit(one_hot_X, tips)
model._thetas
array([[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.]])
By using the scipy.optimize.minimize function, I adapted the code below to implement the fit method of the linear model.
I added a loss_function parameter where the model is fit using the desired loss function, i.e. not necssarily the L2 loss. Example loss function are given as l1 and l2.
X.shape[1]
6
from scipy.optimize import minimize
def l1(y, y_hat):
return np.abs(y - y_hat)
def l2(y, y_hat):
return (y - y_hat)**2
class MyLinearModel():
def predict(self, X):
return X @ self._thetas
def fit(self, loss_function, X, y):
"""
Produce the estimated optimal _thetas for the given loss function,
feature matrix X, and observations y.
Parameters
-----------
loss_function: either the squared or absolute loss functions defined above
X: a 2D dataframe (or numpy array) of numeric features (one-hot encoded)
y: a 1D vector of tip amounts
Returns
-----------
The estimate for the optimal theta vector that minimizes our loss
"""
number_of_features = X.shape[1]
#
# 0. The starting guess should be some arbitrary array of the correct length.
# Note the "number of features" variable above."
# 1. The ... in "lambda theta: ..." should be replaced by the average loss if we
# compute X @ theta. The loss is measured using the given loss function,
# relative to the observations in the variable y.
starting_guess = np.zeros(shape = (number_of_features, 1)) ###Modified --> Here
self._thetas = minimize(lambda theta:
np.mean(loss_function(y, X @ theta)) ###Modified --> Here
, x0 = starting_guess)['x']
model = MyLinearModel()
model.fit(l2, one_hot_X, tips)
model._thetas
/tmp/ipykernel_106/482866692.py:39: DeprecationWarning: Use of `minimize` with `x0.ndim != 1` is deprecated. Currently, singleton dimensions will be removed from `x0`, but an error will be raised in SciPy 1.11.0. self._thetas = minimize(lambda theta:
array([0.094487 , 0.1759912 , 0.18411073, 0.21655221, 0.15712765,
0.24353564, 0.0151997 , 0.17746083, 0.05600925, 0.15199322,
0.23440138, 0.16626186])
The MSE for the model above was just slightly larger than 1:
from sklearn.metrics import mean_squared_error
mean_squared_error(model.predict(one_hot_X), tips)
1.010353561236632
I also had to fit the model analytically for the L2 loss function. Recalling that with a linear model, I was solving the following optimization problem for least squares:
$$\min_{\theta} ||\Bbb{X}\theta - \Bbb{y}||^2$$The optimal $\hat{\theta}$ when $X^TX$ is invertible is given by the equation: $(X^TX)^{-1}X^TY$
For this part, I implemented the analytic solution above using np.linalg.inv to compute the inverse of $X^TX$.
class MyAnalyticallyFitOLSModel():
def predict(self, X):
return X @ self._thetas
def fit(self, X, y):
"""
Sets _thetas using the analytical solution to the ordinary least squares problem
Parameters
-----------
X: a 2D dataframe (or numpy array) of numeric features (one-hot encoded)
y: a 1D vector of tip amounts
Returns
-----------
The estimate for theta computed using the equation mentioned above
"""
xTx = X.T @ X # SOLUTIONx
xTy = X.T @ y # SOLUTION
self._thetas = np.linalg.inv(xTx) @ xTy # SOLUTION
The cell below is the analytical solution for the tips dataset: A singular matrix error or end up with thetas that are nonsensical (magnitudes greater than 10^15). This is no bueno...
# When you run the code below, you should get back some non zero thetas.
model = MyAnalyticallyFitOLSModel()
model.fit(one_hot_X, tips)
analytical_thetas = model._thetas
analytical_thetas
--------------------------------------------------------------------------- LinAlgError Traceback (most recent call last) Cell In[15], line 4 1 # When you run the code below, you should get back some non zero thetas. 3 model = MyAnalyticallyFitOLSModel() ----> 4 model.fit(one_hot_X, tips) 5 analytical_thetas = model._thetas 6 analytical_thetas Cell In[14], line 21, in MyAnalyticallyFitOLSModel.fit(self, X, y) 19 xTx = X.T @ X # SOLUTIONx 20 xTy = X.T @ y # SOLUTION ---> 21 self._thetas = np.linalg.inv(xTx) @ xTy File <__array_function__ internals>:180, in inv(*args, **kwargs) File /srv/conda/envs/notebook/lib/python3.9/site-packages/numpy/linalg/linalg.py:552, in inv(a) 550 signature = 'D->D' if isComplexType(t) else 'd->d' 551 extobj = get_linalg_error_extobj(_raise_linalgerror_singular) --> 552 ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj) 553 return wrap(ainv.astype(result_t, copy=False)) File /srv/conda/envs/notebook/lib/python3.9/site-packages/numpy/linalg/linalg.py:89, in _raise_linalgerror_singular(err, flag) 88 def _raise_linalgerror_singular(err, flag): ---> 89 raise LinAlgError("Singular matrix") LinAlgError: Singular matrix
I got the error above when trying to calculate the analytical solution for our one-hot encoded tips dataset because my Matrix was NOT Invertible
Fixing the one-hot encoding approach so I didn't get the error from above, I used the code below to one-hot-encode the dataset such that one_hot_X_revised had no redundant features.
X.head(5)
| total_bill | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|
| 0 | 16.99 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | Female | No | Sun | Dinner | 4 |
pd.get_dummies(X)
| total_bill | size | sex_Male | sex_Female | smoker_Yes | smoker_No | day_Thur | day_Fri | day_Sat | day_Sun | time_Lunch | time_Dinner | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| 1 | 10.34 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| 2 | 21.01 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| 3 | 23.68 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| 4 | 24.59 | 4 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 239 | 29.03 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
| 240 | 27.18 | 2 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 241 | 22.67 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 242 | 17.82 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
| 243 | 18.78 | 2 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
244 rows × 12 columns
pd.get_dummies(X, drop_first = True)
| total_bill | size | sex_Female | smoker_No | day_Fri | day_Sat | day_Sun | time_Dinner | |
|---|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 2 | 1 | 1 | 0 | 0 | 1 | 1 |
| 1 | 10.34 | 3 | 0 | 1 | 0 | 0 | 1 | 1 |
| 2 | 21.01 | 3 | 0 | 1 | 0 | 0 | 1 | 1 |
| 3 | 23.68 | 2 | 0 | 1 | 0 | 0 | 1 | 1 |
| 4 | 24.59 | 4 | 1 | 1 | 0 | 0 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 239 | 29.03 | 3 | 0 | 1 | 0 | 1 | 0 | 1 |
| 240 | 27.18 | 2 | 1 | 0 | 0 | 1 | 0 | 1 |
| 241 | 22.67 | 2 | 0 | 0 | 0 | 1 | 0 | 1 |
| 242 | 17.82 | 2 | 0 | 1 | 0 | 1 | 0 | 1 |
| 243 | 18.78 | 2 | 1 | 1 | 0 | 0 | 0 | 1 |
244 rows × 8 columns
def one_hot_encode_revised(data):
"""
Return the one-hot encoded dataframe of our input data, removing redundancies.
Parameters
-----------
data: a dataframe that may include non-numerical features
Returns
-----------
A one-hot encoded dataframe that only contains numeric features without any redundancies.
"""
one_hot_data = pd.get_dummies(data, drop_first = True)
return one_hot_data
one_hot_X_revised = one_hot_encode_revised(X)
numerical_model = MyLinearModel()
numerical_model.fit(l2, one_hot_X_revised, tips)
analytical_model = MyAnalyticallyFitOLSModel()
analytical_model.fit(one_hot_X_revised, tips)
print("The numerical model's loss is: ", mean_squared_error(numerical_model.predict(one_hot_X_revised), tips))
print("The analytical model's loss is: ", mean_squared_error(analytical_model.predict(one_hot_X_revised), tips))
The numerical model's loss is: 1.0332855709135256 The analytical model's loss is: 1.033285570878662
/tmp/ipykernel_106/482866692.py:39: DeprecationWarning: Use of `minimize` with `x0.ndim != 1` is deprecated. Currently, singleton dimensions will be removed from `x0`, but an error will be raised in SciPy 1.11.0. self._thetas = minimize(lambda theta:
By removing the redundant features, I was able to find the inverse of the matrix.
#Loading the Data
df = pd.read_csv("lab7_data.csv", index_col=0)
df.head()
| x | y | |
|---|---|---|
| 0 | -5.000000 | -7.672309 |
| 1 | -4.966555 | -7.779735 |
| 2 | -4.933110 | -7.995938 |
| 3 | -4.899666 | -8.197059 |
| 4 | -4.866221 | -8.183883 |
Plotting the data, I coud see that there was a clear sinusoidal relationship between x and y.
import plotly.express as px
px.scatter(df, x = "x", y = "y")
however, gradient descent is so powerful it can even optimize a nonlinear model. Specifically, I modeled the relationship of the data by:
$$\Large{ f_{\boldsymbol{\theta(x)}} = \theta_1x + sin(\theta_2x) }$$My model is parameterized by both $\theta_1$ and $\theta_2$, which can represented in the vector, $\boldsymbol{\theta}$.
Note that a general sine function $a\sin(bx+c)$ has three parameters: amplitude scaling parameter $a$, frequency parameter $b$ and phase shifting parameter $c$.
Here, I assumed the amplitude $a$ was around 1, and the phase shifting parameter $c$ was around zero. I did not attempt to justify this assumption.
I defined the sin_model function below that predicts $\textbf{y}$ (the $y$-values) using $\textbf{x}$ (the $x$-values) based on the new equation.
def sin_model(x, theta):
"""
Predict the estimate of y given x, theta_1, theta_2
Keyword arguments:
x -- the vector of values x
theta -- a vector of length 2, where theta[0] = theta_1 and theta[1] = theta_2
"""
theta_1 = theta[0]
theta_2 = theta[1]
return theta_1 * x + np.sin(theta_2 * x)
$\hat{\theta}$ is the value of $\theta$ that minimizes the loss function. One way of solving for $\hat{\theta}$ could be by computing the gradient of the loss function with respect to $\theta$.
$\frac{\partial L }{\partial \theta_2}$: the partial derivative of $L$ with respect to $\theta_2$
$L(\textbf{x}, \textbf{y}, \theta_1, \theta_2) = \frac{1}{n} \sum_{i=1}^{n} (\textbf{y}_i - \hat{\textbf{y}}_i)^2$
Specifically, the functions sin_MSE, sin_MSE_dt1 and sin_MSE_dt2 compute $R$, $\frac{\partial R }{\partial \theta_1}$ and $\frac{\partial R }{\partial \theta_2}$ respectively. Using the expressions I wrote for $\frac{\partial R }{\partial \theta_1}$ and $\frac{\partial R }{\partial \theta_2}$ I implemented these functions. In the functions below, the parameter theta is a vector that looks like $\begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix}$. Where sin_MSE_gradient, calls dt1 and dt2 and returns the gradient dt.
def sin_MSE(theta, x, y):
"""
Compute the numerical value of the l2 loss of our sinusoidal model given theta
Keyword arguments:
theta -- the vector of values theta
x -- the vector of x values
y -- the vector of y values
"""
return np.mean((y - sin_model(x, theta)) ** 2)
def sin_MSE_dt1(theta, x, y):
"""
Compute the numerical value of the partial of l2 loss with respect to theta_1
Keyword arguments:
theta -- the vector of values theta
x -- the vector of x values
y -- the vector of y values
"""
return np.mean(-2 * x * (y - sin_model(x, theta)))
def sin_MSE_dt2(theta, x, y):
"""
Compute the numerical value of the partial of l2 loss with respect to theta_2
Keyword arguments:
theta -- the vector of values theta
x -- the vector of x values
y -- the vector of y values
"""
return np.mean(-2 * x * np.cos(theta[1] * x) * (y - sin_model(x, theta)))
# This function calls dt1 and dt2 and returns the gradient dt. It is already implemented for you.
def sin_MSE_gradient(theta, x, y):
"""
Returns the gradient of l2 loss with respect to vector theta
Keyword arguments:
theta -- the vector of values theta
x -- the vector of x values
y -- the vector of y values
"""
return np.array([sin_MSE_dt1(theta, x, y), sin_MSE_dt2(theta, x, y)])
def init_theta():
"""Creates an initial theta [0, 0] of shape (2,) as a starting point for gradient descent"""
return np.array([0, 0])
def grad_desc(loss_f, gradient_loss_f, theta, data, num_iter=20, alpha=0.1):
"""
Run gradient descent update for a finite number of iterations and static learning rate
Keyword arguments:
loss_f -- the loss function to be minimized (used for computing loss_history)
gradient_loss_f -- the gradient of the loss function to be minimized
theta -- the vector of values theta to use at first iteration
data -- the data used in the model
num_iter -- the max number of iterations
alpha -- the learning rate (also called the step size)
Return:
theta -- the optimal value of theta after num_iter of gradient descent
theta_history -- the series of theta values over each iteration of gradient descent
loss_history -- the series of loss values over each iteration of gradient descent
"""
theta_history = []
loss_history = []
for i in range(num_iter):
theta_history.append(theta)
loss_history.append(loss_f(theta, data['x'], data['y']))
theta = theta - alpha * gradient_loss_f(theta, data['x'], data['y'])
return theta, theta_history, loss_history
theta_start = init_theta()
theta_hat, thetas_used, losses_calculated = grad_desc(
sin_MSE, sin_MSE_gradient, theta_start, df, num_iter=20, alpha=0.1
)
for b, l in zip(thetas_used, losses_calculated):
print(f"theta: {b}, Loss: {l}")
theta: [0 0], Loss: 20.859191416422235 theta: [2.60105745 2.60105745], Loss: 9.285008173048666 theta: [0.90342728 2.59100602], Loss: 4.680169273815357 theta: [2.05633644 2.9631291 ], Loss: 2.6242517936325833 theta: [1.15892347 2.86687431], Loss: 1.4765157174727774 theta: [1.79388042 3.07275573], Loss: 0.9073271435862448 theta: [1.32157494 3.00146569], Loss: 0.541531643291128 theta: [1.64954491 3.02910866], Loss: 0.3775841142469479 theta: [1.42325294 2.98820303], Loss: 0.2969750688130759 theta: [1.58295041 3.01033846], Loss: 0.2590425421375732 theta: [1.47097255 2.98926519], Loss: 0.23973439443291833 theta: [1.55040965 3.0017442 ], Loss: 0.23034782416254634 theta: [1.49439132 2.99135194], Loss: 0.2255775832667724 theta: [1.5341564 2.99797824], Loss: 0.22321772191904068 theta: [1.50603995 2.99286671], Loss: 0.22202363967204045 theta: [1.52598919 2.99628665], Loss: 0.22142811500262397 theta: [1.51186655 2.99375531], Loss: 0.22112776381775168 theta: [1.52188208 2.99549617], Loss: 0.22097741373654575 theta: [1.51478773 2.99423497], Loss: 0.22090173185683037 theta: [1.51981739 2.99511516], Loss: 0.2208637810584589
Now, I visually inspected my results of running gradient descent to optimize $\boldsymbol\theta$. The code below plots the $x$-values with the model's predicted $\hat{y}$-values over the original scatter plot. we can notice that gradient descent successfully optimized $\boldsymbol\theta$.
theta_init = init_theta()
theta_est, thetas, loss = grad_desc(sin_MSE, sin_MSE_gradient, theta_init, df)
Plotting the model output over the observations shows that gradient descent did a great job finding both the overall increase (slope) of the data, as well as the oscillation frequency.
x, y = df['x'], df['y']
y_pred = sin_model(x, theta_est)
plt.plot(x, y_pred, label='Model ($\hat{y}$)')
plt.scatter(x, y, alpha=0.5, label='Observation ($y$)', color='gold')
plt.legend();
I also visualized the loss functions and gained some insight as to how gradient descent optimizes the model parameters.
The previous plot showed the loss decrease with each iteration. In this part, I display the trajectory of the algorithm as it travels the loss surface.
thetas = np.array(thetas).squeeze()
loss = np.array(loss)
thetas
array([[0. , 0. ],
[2.60105745, 2.60105745],
[0.90342728, 2.59100602],
[2.05633644, 2.9631291 ],
[1.15892347, 2.86687431],
[1.79388042, 3.07275573],
[1.32157494, 3.00146569],
[1.64954491, 3.02910866],
[1.42325294, 2.98820303],
[1.58295041, 3.01033846],
[1.47097255, 2.98926519],
[1.55040965, 3.0017442 ],
[1.49439132, 2.99135194],
[1.5341564 , 2.99797824],
[1.50603995, 2.99286671],
[1.52598919, 2.99628665],
[1.51186655, 2.99375531],
[1.52188208, 2.99549617],
[1.51478773, 2.99423497],
[1.51981739, 2.99511516]])
from lab7_utils import plot_3d
plot_3d(thetas[:, 0], thetas[:, 1], loss, mean_squared_error, sin_model, x, y)
import plotly
import plotly.graph_objs as go
def contour_plot(title, theta_history, loss_function, model, x, y):
"""
The function takes the following as argument:
theta_history: a (N, 2) array of theta history
loss: a list or array of loss value
loss_function: for example, l2_loss
model: for example, sin_model
x: the original x input
y: the original y output
"""
theta_1_series = theta_history[:,0] # a list or array of theta_1 value
theta_2_series = theta_history[:,1] # a list or array of theta_2 value
## In the following block of code, I generate the z value
## across a 2D grid
theta1_s = np.linspace(np.min(theta_1_series) - 0.1, np.max(theta_1_series) + 0.1)
theta2_s = np.linspace(np.min(theta_2_series) - 0.1, np.max(theta_2_series) + 0.1)
x_s, y_s = np.meshgrid(theta1_s, theta2_s)
data = np.stack([x_s.flatten(), y_s.flatten()]).T
ls = []
for theta1, theta2 in data:
l = loss_function(model(x, np.array([theta1, theta2])), y)
ls.append(l)
z = np.array(ls).reshape(50, 50)
# Create trace of theta point
# Create the contour
theta_points = go.Scatter(name="theta Values",
x=theta_1_series,
y=theta_2_series,
mode="lines+markers")
lr_loss_contours = go.Contour(x=theta1_s,
y=theta2_s,
z=z,
colorscale='Viridis', reversescale=True)
plotly.offline.iplot(go.Figure(data=[lr_loss_contours, theta_points], layout={'title': title}))
contour_plot('Gradient Descent with Static Learning Rate', thetas, mean_squared_error, sin_model, df["x"], df["y"])
As we can see, gradient descent is able to navigate even this fairly complex loss space and find a nice minimum.